Adds support for Cholla's new Concatenation Format by mabruzzo · Pull Request #5170 · yt-project/yt

mabruzzo · 2025-05-21T18:26:25Z

PR Summary

This PR adds support for loading Cholla datasets that are:

distributed over multiple files
stored in Cholla's historical concatenation datasets

For context:

when Cholla writes out snapshots, it writes each block to a separate file.
The Cholla repository provide python scripts to concatenate data, which has historically concatenate all of the data so that the resulting file would appear like there was only 1 process that evolved a single gigantic block. These scripts were commonly used to simplify data analysis.
Until this PR, the yt-frontend has only known how to load datasets after they were concatenated in this manner. This introduced challenges because:
1. the yt-frontend must load everything at once (introducing obvious memory issues)
2. in cases where the number of cells exceeds 2^31 yt would simply break (this came up on large-memory nodes)

This PR introduces support for reading Cholla datasets that are either:

distributed (i.e. they are not concatenated), OR
use Cholla's newly added concatenation format.
- I introduced the new concatenation format with the intent of improving performance when using yt (Introduce new organizational schema format for storing Cholla data cholla-hydro/cholla#427). Essentially the new format organizes data in a manner based on the underlying block structure (in a similar manner to what an AMR simulation might do). This let's us reuse a lot of the machinery for reading distributed datasets
- It's important that the yt-frontend can read in this format because (i) it provides some performance advantages (on parallel filesystems) and (ii) we have scripts to repackage datasets using the old concatenation format so that they now have this new format.

Aside: This PR superseded PR #4702

PR Checklist

New features are documented, with docstrings and narrative docs

Co-authored-by: Matthew Turk <matthewturk@gmail.com>

Co-authored-by: Clément Robert <cr52@protonmail.com>

mabruzzo · 2025-08-26T17:07:16Z

I lightly refactored a little code

I realized that _split_fname_procid_suffix was no longer used for anything. I think it's nice to have a function that explicitly describes the relationship between the filename passed into yt.load and the filename template. Consequently I replaced this function with a new function called `_infer_fname_template` that we call from `_determine_data_layout` (the contents didn't change dramatically)

mabruzzo · 2025-10-15T10:48:17Z

@neutrinoceros, @matthewturk -- I just wanted to ping you to see if there is anything I can do to make this PR easier to review. I would love to see this get merged in time for the 4.5 release (whenever that is)

-> this testing strategy creates synthetic data and performs a round-trip test to confirm that the data is loaded properly

This is in anticipation of a followup PR adding support for particles

matthewturk · 2025-10-31T19:24:43Z

@mabruzzo I am so sorry for not seeing and responding sooner.

I have reviewed this and it looks really good. I'm OK with either doing another pass or putting it in. Thank you for your hard work on it.

mabruzzo · 2025-10-31T19:43:01Z

@matthewturk: Not a problem. Thanks for looking it over.

I just pushed 4 commits today because I had started to work on a follow-up PR (that adds Particle support) and I was trying to minimize how much the followup PR would change things. The 3 new commits both:

reorganize some of the code and factor out an idiom I was using multiple times (related to opening and closing hdf5 files)
added some new unit tests to actually test a bunch of functionality (it generates synthetic datasets in different formats and then tries to load them)

Unfortunately, these tests aren't actually passing on CI (Frankly, I don't know why -- they work when I run pytest yt/frontends/cholla/tests/test_load.py on my local machine...) If you can spot an obvious mistake that I'm making, I'm happy to try to fix it. Otherwise, I'm happy to move these 4 commits to a separate PR

mabruzzo · 2025-10-31T19:49:49Z

Ok, I think the problem is that my new test relies upon pytest fixtures (and other machinery) and I need to somehow exclude the test from the nose test-runner

mabruzzo · 2025-10-31T19:53:14Z

Is there an easy way to exclude the test from the nose test runner (but not the pytest runner?)

neutrinoceros · 2025-10-31T20:11:42Z

Yes. Add it to nose_ignores.txt

mabruzzo and others added 10 commits October 16, 2023 12:59

Add support for distributed cholla datasets.

4428058

Update yt/frontends/cholla/data_structures.py

5b51057

Co-authored-by: Matthew Turk <matthewturk@gmail.com>

Merge branch 'main' into cholla-frontend-improvements

ae37959

improve the docstring for _split_fname_procid_suffix

448b483

Merge branch 'main' into cholla-frontend-improvements

a51fe81

add support for new cholla concatenation format

d5fa078

Merge branch 'main' into cholla-frontend-newformat

b21a792

try to satisfy mypy

250dbd9

fix the cholla frontend

f5e92f0

finally addressing the mypy type-checking issues

a26cbff

neutrinoceros added enhancement Making something better code frontends Things related to specific frontends frontend: cholla labels Jun 26, 2025